Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 74
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38478393

RESUMO

SUMMARY: Knowledge of immunoglobulin and T cell receptor encoding genes is derived from high-quality genomic sequencing. High-throughput sequencing is delivering large volumes of data, and precise, high-throughput approaches to annotation are needed. Digger is an automated tool that identifies coding and regulatory regions of these genes, with results comparable to those obtained by current expert curational methods. AVAILABILITY AND IMPLEMENTATION: Digger is published under open source license at https://github.com/williamdlees/Digger and is available as a Python package and a Docker container.


Assuntos
Receptores de Antígenos de Linfócitos T , Software , Receptores de Antígenos de Linfócitos T/genética , Mapeamento Cromossômico , Imunoglobulinas/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos
2.
bioRxiv ; 2024 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-38293151

RESUMO

Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) is a valuable experimental tool to study the immune state in health and following immune challenges such as infectious diseases, (auto)immune diseases, and cancer. Several tools have been developed to reconstruct B cell and T cell receptor sequences from AIRR-seq data and infer B and T cell clonal relationships. However, currently available tools offer limited parallelization across samples, scalability or portability to high-performance computing infrastructures. To address this need, we developed nf-core/airrflow, an end-to-end bulk and single-cell AIRR-seq processing workflow which integrates the Immcantation Framework following BCR and TCR sequencing data analysis best practices. The Immcantation Framework is a comprehensive toolset, which allows the processing of bulk and single-cell AIRR-seq data from raw read processing to clonal inference. nf-core/airrflow is written in Nextflow and is part of the nf-core project, which collects community contributed and curated Nextflow workflows for a wide variety of analysis tasks. We assessed the performance of nf-core/airrflow on simulated sequencing data with sequencing errors and show example results with real datasets. To demonstrate the applicability of nf-core/airrflow to the high-throughput processing of large AIRR-seq datasets, we validated and extended previously reported findings of convergent antibody responses to SARS-CoV-2 by analyzing 97 COVID-19 infected individuals and 99 healthy controls, including a mixture of bulk and single-cell sequencing datasets. Using this dataset, we extended the convergence findings to 20 additional subjects, highlighting the applicability of nf-core/airrflow to validate findings in small in-house cohorts with reanalysis of large publicly available AIRR datasets. nf-core/airrflow is available free of charge, under the MIT license on GitHub (https://github.com/nf-core/airrflow). Detailed documentation and example results are available on the nf-core website at (https://nf-co.re/airrflow).

3.
Cell Rep ; 42(8): 112879, 2023 08 29.
Artigo em Inglês | MEDLINE | ID: mdl-37537844

RESUMO

Neuroblastoma is a lethal childhood solid tumor of developing peripheral nerves. Two percent of children with neuroblastoma develop opsoclonus myoclonus ataxia syndrome (OMAS), a paraneoplastic disease characterized by cerebellar and brainstem-directed autoimmunity but typically with outstanding cancer-related outcomes. We compared tumor transcriptomes and tumor-infiltrating T and B cell repertoires from 38 OMAS subjects with neuroblastoma to 26 non-OMAS-associated neuroblastomas. We found greater B and T cell infiltration in OMAS-associated tumors compared to controls and showed that both were polyclonal expansions. Tertiary lymphoid structures (TLSs) were enriched in OMAS-associated tumors. We identified significant enrichment of the major histocompatibility complex (MHC) class II allele HLA-DOB∗01:01 in OMAS patients. OMAS severity scores were associated with the expression of several candidate autoimmune genes. We propose a model in which polyclonal auto-reactive B lymphocytes act as antigen-presenting cells and drive TLS formation, thereby supporting both sustained polyclonal T cell-mediated anti-tumor immunity and paraneoplastic OMAS neuropathology.


Assuntos
Neuroblastoma , Síndrome de Opsoclonia-Mioclonia , Criança , Humanos , Autoimunidade , Neuroblastoma/complicações , Neuroblastoma/metabolismo , Síndrome de Opsoclonia-Mioclonia/complicações , Síndrome de Opsoclonia-Mioclonia/patologia , Autoanticorpos , Genes MHC da Classe II , Ataxia
4.
Nucleic Acids Res ; 51(16): e86, 2023 09 08.
Artigo em Inglês | MEDLINE | ID: mdl-37548401

RESUMO

In adaptive immune receptor repertoire analysis, determining the germline variable (V) allele associated with each T- and B-cell receptor sequence is a crucial step. This process is highly impacted by allele annotations. Aligning sequences, assigning them to specific germline alleles, and inferring individual genotypes are challenging when the repertoire is highly mutated, or sequence reads do not cover the whole V region. Here, we propose an alternative naming scheme for the V alleles, as well as a novel method to infer individual genotypes. We demonstrate the strengths of the two by comparing their outcomes to other genotype inference methods. We validate the genotype approach with independent genomic long-read data. The naming scheme is compatible with current annotation tools and pipelines. Analysis results can be converted from the proposed naming scheme to the nomenclature determined by the International Union of Immunological Societies (IUIS). Both the naming scheme and the genotype procedure are implemented in a freely available R package (PIgLET https://bitbucket.org/yaarilab/piglet). To allow researchers to further explore the approach on real data and to adapt it for their uses, we also created an interactive website (https://yaarilab.github.io/IGHV_reference_book).


Assuntos
Genômica , Cadeias Pesadas de Imunoglobulinas , Receptores de Antígenos de Linfócitos B , Alelos , Genótipo , Receptores de Antígenos de Linfócitos B/genética , Cadeias Pesadas de Imunoglobulinas/genética
5.
Bioinformatics ; 39(7)2023 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-37417959

RESUMO

MOTIVATION: T-cell receptor beta chain (TCRB) repertoires are crucial for understanding immune responses. However, their high diversity and complexity present significant challenges in representation and analysis. The main motivation of this study is to develop a unified and compact representation of a TCRB repertoire that can efficiently capture its inherent complexity and diversity and allow for direct inference. RESULTS: We introduce a novel approach to TCRB repertoire encoding and analysis, leveraging the Lempel-Ziv 76 algorithm. This approach allows us to create a graph-like model, identify-specific sequence features, and produce a new encoding approach for an individual's repertoire. The proposed representation enables various applications, including generation probability inference, informative feature vector derivation, sequence generation, a new measure for diversity estimation, and a new sequence centrality measure. The approach was applied to four large-scale public TCRB sequencing datasets, demonstrating its potential for a wide range of applications in big biological sequencing data. AVAILABILITY AND IMPLEMENTATION: Python package for implementation is available https://github.com/MuteJester/LZGraphs.


Assuntos
Compressão de Dados , Receptores de Antígenos de Linfócitos T alfa-beta , Receptores de Antígenos de Linfócitos T alfa-beta/genética , Algoritmos , Receptores de Antígenos de Linfócitos T/genética
6.
Artigo em Inglês | MEDLINE | ID: mdl-37388275

RESUMO

Analysis of an individual's immunoglobulin or T cell receptor gene repertoire can provide important insights into immune function. High-quality analysis of adaptive immune receptor repertoire sequencing data depends upon accurate and relatively complete germline sets, but current sets are known to be incomplete. Established processes for the review and systematic naming of receptor germline genes and alleles require specific evidence and data types, but the discovery landscape is rapidly changing. To exploit the potential of emerging data, and to provide the field with improved state-of-the-art germline sets, an intermediate approach is needed that will allow the rapid publication of consolidated sets derived from these emerging sources. These sets must use a consistent naming scheme and allow refinement and consolidation into genes as new information emerges. Name changes should be minimised, but, where changes occur, the naming history of a sequence must be traceable. Here we outline the current issues and opportunities for the curation of germline IG/TR genes and present a forward-looking data model for building out more robust germline sets that can dovetail with current established processes. We describe interoperability standards for germline sets, and an approach to transparency based on principles of findability, accessibility, interoperability, and reusability.

7.
Front Immunol ; 14: 1031914, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37153628

RESUMO

Introduction: The success of the human body in fighting SARS-CoV2 infection relies on lymphocytes and their antigen receptors. Identifying and characterizing clinically relevant receptors is of utmost importance. Methods: We report here the application of a machine learning approach, utilizing B cell receptor repertoire sequencing data from severely and mildly infected individuals with SARS-CoV2 compared with uninfected controls. Results: In contrast to previous studies, our approach successfully stratifies non-infected from infected individuals, as well as disease level of severity. The features that drive this classification are based on somatic hypermutation patterns, and point to alterations in the somatic hypermutation process in COVID-19 patients. Discussion: These features may be used to build and adapt therapeutic strategies to COVID-19, in particular to quantitatively assess potential diagnostic and therapeutic antibodies. These results constitute a proof of concept for future epidemiological challenges.


Assuntos
Linfócitos B , COVID-19 , Humanos , Receptores de Antígenos de Linfócitos B/genética , RNA Viral , SARS-CoV-2/genética , Gravidade do Paciente
8.
J Immunol ; 210(10): 1607-1619, 2023 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-37027017

RESUMO

Current Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) using short-read sequencing strategies resolve expressed Ab transcripts with limited resolution of the C region. In this article, we present the near-full-length AIRR-seq (FLAIRR-seq) method that uses targeted amplification by 5' RACE, combined with single-molecule, real-time sequencing to generate highly accurate (99.99%) human Ab H chain transcripts. FLAIRR-seq was benchmarked by comparing H chain V (IGHV), D (IGHD), and J (IGHJ) gene usage, complementarity-determining region 3 length, and somatic hypermutation to matched datasets generated with standard 5' RACE AIRR-seq using short-read sequencing and full-length isoform sequencing. Together, these data demonstrate robust FLAIRR-seq performance using RNA samples derived from PBMCs, purified B cells, and whole blood, which recapitulated results generated by commonly used methods, while additionally resolving H chain gene features not documented in IMGT at the time of submission. FLAIRR-seq data provide, for the first time, to our knowledge, simultaneous single-molecule characterization of IGHV, IGHD, IGHJ, and IGHC region genes and alleles, allele-resolved subisotype definition, and high-resolution identification of class switch recombination within a clonal lineage. In conjunction with genomic sequencing and genotyping of IGHC genes, FLAIRR-seq of the IgM and IgG repertoires from 10 individuals resulted in the identification of 32 unique IGHC alleles, 28 (87%) of which were previously uncharacterized. Together, these data demonstrate the capabilities of FLAIRR-seq to characterize IGHV, IGHD, IGHJ, and IGHC gene diversity for the most comprehensive view of bulk-expressed Ab repertoires to date.


Assuntos
Regiões Determinantes de Complementaridade , Humanos , Regiões Determinantes de Complementaridade/genética , Sequência de Bases
9.
Nat Commun ; 14(1): 1462, 2023 03 16.
Artigo em Inglês | MEDLINE | ID: mdl-36927854

RESUMO

Protection from viral infections depends on immunoglobulin isotype switching, which endows antibodies with effector functions. Here, we find that the protein kinase DYRK1A is essential for B cell-mediated protection from viral infection and effective vaccination through regulation of class switch recombination (CSR). Dyrk1a-deficient B cells are impaired in CSR activity in vivo and in vitro. Phosphoproteomic screens and kinase-activity assays identify MSH6, a DNA mismatch repair protein, as a direct substrate for DYRK1A, and deletion of a single phosphorylation site impaired CSR. After CSR and germinal center (GC) seeding, DYRK1A is required for attenuation of B cell proliferation. These findings demonstrate DYRK1A-mediated biological mechanisms of B cell immune responses that may be used for therapeutic manipulation in antibody-mediated autoimmunity.


Assuntos
Linfócitos B , Switching de Imunoglobulina , Fosforilação , Switching de Imunoglobulina/genética , Centro Germinativo , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo
10.
Cell Mol Gastroenterol Hepatol ; 16(1): 63-81, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36965814

RESUMO

BACKGROUND & AIMS: Hepatocellular carcinoma (HCC) is a model of a diverse spectrum of cancers because it is induced by well-known etiologies, mainly hepatitis C virus (HCV) and hepatitis B virus. Here, we aimed to identify HCV-specific mutational signatures and explored the link between the HCV-related regional variation in mutations rates and HCV-induced alterations in genome-wide chromatin organization. METHODS: To identify an HCV-specific mutational signature in HCC, we performed high-resolution targeted sequencing to detect passenger mutations on 64 HCC samples from 3 etiology groups: hepatitis B virus, HCV, or other. To explore the link between the genomic signature and genome-wide chromatin organization we performed chromatin immunoprecipitation sequencing for the transcriptionally permissive H3K4Me3, H3K9Ac, and suppressive H3K9Me3 modifications after HCV infection. RESULTS: Regional variation in mutation rate analysis showed significant etiology-dependent regional mutation rates in 12 genes: LRP2, KRT84, TMEM132B, DOCK2, DMD, INADL, JAK2, DNAH6, MTMR9, ATM, SLX4, and ARSD. We found an enrichment of C->T transversion mutations in the HCV-associated HCC cases. Furthermore, these cases showed regional variation in mutation rates associated with genomic intervals in which HCV infection dictated epigenetic alterations. This signature may be related to the HCV-induced decreased expression of genes encoding key enzymes in the base excision repair pathway. CONCLUSIONS: We identified novel distinct HCV etiology-dependent mutation signatures in HCC associated with HCV-induced alterations in histone modification. This study presents a link between cancer-causing mutagenesis and the increased predisposition to liver cancer in chronic HCV-infected individuals, and unveils novel etiology-specific mechanisms leading to HCC and cancer in general.


Assuntos
Carcinoma Hepatocelular , Hepatite C , Neoplasias Hepáticas , Humanos , Neoplasias Hepáticas/patologia , Carcinoma Hepatocelular/patologia , Hepatite C/complicações , Hepatite C/genética , Mutação/genética , Hepacivirus/genética , Vírus da Hepatite B/genética , Epigênese Genética/genética , Cromatina , Genômica , Proteínas Tirosina Fosfatases não Receptoras/genética , Queratinas Tipo II/genética , Queratinas Específicas do Cabelo/genética
11.
Genome Res ; 33(1): 71-79, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36526432

RESUMO

Crohn's disease (CD) is a chronic relapsing-remitting inflammatory disorder of the gastrointestinal tract that is characterized by altered innate and adaptive immune function. Although massively parallel sequencing studies of the T cell receptor repertoire identified oligoclonal expansion of unique clones, much less is known about the B cell receptor (BCR) repertoire in CD. Here, we present a novel BCR repertoire sequencing data set from ileal biopsies from pediatric patients with CD and controls, and identify CD-specific somatic hypermutation (SHM) patterns, revealed by a machine learning (ML) algorithm trained on BCR repertoire sequences. Moreover, ML classification of a different data set from blood samples of adults with CD versus controls identified that V gene usage, clusters, or mutation frequencies yielded excellent results in classifying the disease (F1 > 90%). In summary, we show that an ML algorithm enables the classification of CD based on unique BCR repertoire features with high accuracy.


Assuntos
Doença de Crohn , Adulto , Humanos , Criança , Doença de Crohn/genética , Aprendizado de Máquina , Biópsia , Algoritmos , Doença Crônica
12.
Front Immunol ; 14: 1330153, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38406579

RESUMO

Introduction: Analysis of an individual's immunoglobulin (IG) gene repertoire requires the use of high-quality germline gene reference sets. When sets only contain alleles supported by strong evidence, AIRR sequencing (AIRR-seq) data analysis is more accurate and studies of the evolution of IG genes, their allelic variants and the expressed immune repertoire is therefore facilitated. Methods: The Adaptive Immune Receptor Repertoire Community (AIRR-C) IG Reference Sets have been developed by including only human IG heavy and light chain alleles that have been confirmed by evidence from multiple high-quality sources. To further improve AIRR-seq analysis, some alleles have been extended to deal with short 3' or 5' truncations that can lead them to be overlooked by alignment utilities. To avoid other challenges for analysis programs, exact paralogs (e.g. IGHV1-69*01 and IGHV1-69D*01) are only represented once in each set, though alternative sequence names are noted in accompanying metadata. Results and discussion: The Reference Sets include less than half the previously recognised IG alleles (e.g. just 198 IGHV sequences), and also include a number of novel alleles: 8 IGHV alleles, 2 IGKV alleles and 5 IGLV alleles. Despite their smaller sizes, erroneous calls were eliminated, and excellent coverage was achieved when a set of repertoires comprising over 4 million V(D)J rearrangements from 99 individuals were analyzed using the Sets. The version-tracked AIRR-C IG Reference Sets are freely available at the OGRDB website (https://ogrdb.airr-community.org/germline_sets/Human) and will be regularly updated to include newly observed and previously reported sequences that can be confirmed by new high-quality data.


Assuntos
Genes de Imunoglobulinas , Imunoglobulinas , Humanos , Imunoglobulinas/genética , Alelos , Recombinação V(D)J/genética , Células Germinativas
13.
Front Immunol ; 13: 888555, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35720344

RESUMO

The immunoglobulin genes of inbred mouse strains that are commonly used in models of antibody-mediated human diseases are poorly characterized. This compromises data analysis. To infer the immunoglobulin genes of BALB/c mice, we used long-read SMRT sequencing to amplify VDJ-C sequences from F1 (BALB/c x C57BL/6) hybrid animals. Strain variations were identified in the Ighm and Ighg2b genes, and analysis of VDJ rearrangements led to the inference of 278 germline IGHV alleles. 169 alleles are not present in the C57BL/6 genome reference sequence. To establish a set of expressed BALB/c IGHV germline gene sequences, we computationally retrieved IGHV haplotypes from the IgM dataset. Haplotyping led to the confirmation of 162 BALB/c IGHV gene sequences. A musIGHV398 pseudogene variant also appears to be present in the BALB/cByJ substrain, while a functional musIGHV398 gene is highly expressed in the BALB/cJ substrain. Only four of the BALB/c alleles were also observed in the C57BL/6 haplotype. The full set of inferred BALB/c sequences has been used to establish a BALB/c IGHV reference set, hosted at https://ogrdb.airr-community.org. We assessed whether assemblies from the Mouse Genome Project (MGP) are suitable for the determination of the genes of the IGH loci. Only 37 (43.5%) of the 85 confirmed IMGT-named BALB/c IGHV and 33 (42.9%) of the 77 confirmed non-IMGT IGHV were found in a search of the MGP BALB/cJ genome assembly. This suggests that current MGP assemblies are unsuitable for the comprehensive documentation of germline IGHVs and more efforts will be needed to establish strain-specific reference sets.


Assuntos
Cadeias Pesadas de Imunoglobulinas , Região Variável de Imunoglobulina , Animais , Haplótipos , Cadeias Pesadas de Imunoglobulinas/genética , Região Variável de Imunoglobulina/genética , Camundongos , Camundongos Endogâmicos BALB C , Camundongos Endogâmicos C57BL , Análise de Sequência de DNA
14.
J Immunol ; 208(12): 2713-2725, 2022 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-35623663

RESUMO

The immune system matures throughout childhood to achieve full functionality in protecting our bodies against threats. The immune system has a strong reciprocal symbiosis with the host bacterial population and the two systems co-develop, shaping each other. Despite their fundamental role in health physiology, the ontogeny of these systems is poorly characterized. In this study, we investigated the development of the BCR repertoire by analyzing high-throughput sequencing of their receptors in several time points of young C57BL/6J mice. In parallel, we explored the development of the gut microbiome. We discovered that the gut IgA repertoires change from birth to adolescence, including an increase in CDR3 lengths and somatic hypermutation levels. This contrasts with the spleen IgM repertoires that remain stable and distinct from the IgA repertoires in the gut. We also discovered that large clones that germinate in the gut are initially confined to a specific gut compartment, then expand to nearby compartments and later on expand also to the spleen and remain there. Finally, we explored the associations between diversity indices of the B cell repertoires and the microbiome, as well as associations between bacterial and BCR clusters. Our results shed light on the ontogeny of the adaptive immune system and the microbiome, providing a baseline for future research.


Assuntos
Microbiota , Animais , Sequenciamento de Nucleotídeos em Larga Escala , Imunoglobulina A/genética , Camundongos , Camundongos Endogâmicos C57BL , Receptores de Antígenos de Linfócitos B/genética
15.
Cell ; 185(7): 1208-1222.e21, 2022 03 31.
Artigo em Inglês | MEDLINE | ID: mdl-35305314

RESUMO

The tumor microenvironment hosts antibody-secreting cells (ASCs) associated with a favorable prognosis in several types of cancer. Patient-derived antibodies have diagnostic and therapeutic potential; yet, it remains unclear how antibodies gain autoreactivity and target tumors. Here, we found that somatic hypermutations (SHMs) promote antibody antitumor reactivity against surface autoantigens in high-grade serous ovarian carcinoma (HGSOC). Patient-derived tumor cells were frequently coated with IgGs. Intratumoral ASCs in HGSOC were both mutated and clonally expanded and produced tumor-reactive antibodies that targeted MMP14, which is abundantly expressed on the tumor cell surface. The reversion of monoclonal antibodies to their germline configuration revealed two types of classes: one dependent on SHMs for tumor binding and a second with germline-encoded autoreactivity. Thus, tumor-reactive autoantibodies are either naturally occurring or evolve through an antigen-driven selection process. These findings highlight the origin and potential applicability of autoantibodies directed at surface antigens for tumor targeting in cancer patients.


Assuntos
Anticorpos Antineoplásicos , Neoplasias Ovarianas , Anticorpos Monoclonais , Autoanticorpos , Autoantígenos , Feminino , Humanos , Neoplasias Ovarianas/genética , Microambiente Tumoral
16.
Genome Med ; 14(1): 2, 2022 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-34991709

RESUMO

BACKGROUND: T and B cell receptor (TCR, BCR) repertoires constitute the foundation of adaptive immunity. Adaptive immune receptor repertoire sequencing (AIRR-seq) is a common approach to study immune system dynamics. Understanding the genetic factors influencing the composition and dynamics of these repertoires is of major scientific and clinical importance. The chromosomal loci encoding for the variable regions of TCRs and BCRs are challenging to decipher due to repetitive elements and undocumented structural variants. METHODS: To confront this challenge, AIRR-seq-based methods have recently been developed for B cells, enabling genotype and haplotype inference and discovery of undocumented alleles. However, this approach relies on complete coverage of the receptors' variable regions, whereas most T cell studies sequence a small fraction of that region. Here, we adapted a B cell pipeline for undocumented alleles, genotype, and haplotype inference for full and partial AIRR-seq TCR data sets. The pipeline also deals with gene assignment ambiguities, which is especially important in the analysis of data sets of partial sequences. RESULTS: From the full and partial AIRR-seq TCR data sets, we identified 39 undocumented polymorphisms in T cell receptor Beta V (TRBV) and 31 undocumented 5 ' UTR sequences. A subset of these inferences was also observed using independent genomic approaches. We found that a single nucleotide polymorphism differentiating between the two documented T cell receptor Beta D2 (TRBD2) alleles is strongly associated with dramatic changes in the expressed repertoire. CONCLUSIONS: We reveal a rich picture of germline variability and demonstrate how a single nucleotide polymorphism dramatically affects the composition of the whole repertoire. Our findings provide a basis for annotation of TCR repertoires for future basic and clinical studies.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Receptores de Antígenos de Linfócitos T alfa-beta , Alelos , Células Germinativas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Receptores de Antígenos de Linfócitos T/genética , Receptores de Antígenos de Linfócitos T alfa-beta/genética
17.
Gigascience ; 122022 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-37848619

RESUMO

BACKGROUND: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires. RESULTS: We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. CONCLUSIONS: This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.


Assuntos
Benchmarking , Simulação por Computador
19.
iScience ; 24(10): 103192, 2021 Oct 22.
Artigo em Inglês | MEDLINE | ID: mdl-34693229

RESUMO

Inference of germline polymorphisms in immunoglobulin genes from B cell receptor repertoires is complicated by somatic hypermutations, sequencing/PCR errors, and by varying length of reference alleles. The light chain inference is particularly challenging owing to large gene duplications and absence of D genes. We analyzed the light chain cDNA sequences from naïve B cell receptor repertoires from 100 individuals. We optimized light chain allele inference by tweaking parameters of the TIgGER functions, extending the germline reference sequences, and establishing mismatch frequency patterns at polymorphic positions to filter out false-positive candidates. We identified 48 previously unreported variants of light chain variable genes. We selected 14 variants for validation and successfully validated 11 by Sanger sequencing. Clustering of light chain 5'UTR, L-PART1, and L-PART2 revealed partial intron retention in 11 kappa and 9 lambda V alleles. Our results provide insight into germline variation in human light chain immunoglobulin loci.

20.
Front Immunol ; 12: 680687, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34367141

RESUMO

The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.


Assuntos
Biologia Computacional/métodos , Rearranjo Gênico do Linfócito B , Rearranjo Gênico do Linfócito T , Sequenciamento de Nucleotídeos em Larga Escala , Receptores de Antígenos de Linfócitos B/genética , Receptores de Antígenos de Linfócitos T/genética , Software , Algoritmos , Animais , Humanos , Processamento de Linguagem Natural , Receptores de Antígenos de Linfócitos B/metabolismo , Receptores de Antígenos de Linfócitos T/metabolismo , Fluxo de Trabalho
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...